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ABSTRACT 



We consider the problem of learning co-occurrence in- 
formation between two word categories, or more in 
general between two discrete random variables taking 
values in a hierarchically classified domain. In par- 
ticular, we consider the problem of learning the 'as- 
sociation norm' defined by A(x, y) = p(x, y) /p(x)p(y), 
where p(x, y) is the joint distribution for x and y and 
p(x) and p{y) are marginal distributions induced by 
p(x,y). We formulate this problem as a sub-task of 
learning the conditional distribution p(x\y), by ex- 
ploiting the identity p(x\y) = A(x,y) ■ p{x). We pro- 
pose a two-step estimation method based on the MDL 
principle, which works as follows: It first estimates 
p[x) as p using MDL, and then estimates p(x\y) for 
a fixed y by applying MDL on the hypothesis class of 
G ^4} for some given class A of representations 
for association norm. The estimation of A is therefore 
obtained as a side-effect of a near optimal estimation of 
p(x\y). We then apply this general framework to the 
problem of acquiring case-frame patterns, an impor- 
tant task in corpus-based natural language processing. 
We assume that both p(x) and A(x, y) for given y are 
representable by a model based on a classification that 
exists within an existing thesaurus tree as a 'cut,' and 
hence p(x\y) is represented as the product of a pair 
of 'tree cut models.' We then devise an efficient algo- 
rithm that implements our general strategy. We tested 
our method by using it to actually acquire case-frame 
patterns and conducted syntactic disambiguation ex- 
periments using the acquired knowledge. The exper- 
imental results show that our method improves upon 
existing methods. 

Keywords: Unsupervised learning, Learning associa- 
tion norm, MDL estimation. 



1 Introduction 

A central issue in natural language processing is that of 
ambiguity resolution in syntactic parsing and it is gen- 
erally acknowledged that a certain amount of semantic 
knowledge is required for this. In particular, the case 
frames of verbs, namely the knowledge of which nouns 
are allowed at given case slots of given verbs, is crucial 
for this purpose. Such knowledge is not available in 
existing dictionaries in a satisfactory form, and hence 
the problem of automatically acquiring such knowl- 
edge from large corpus data has become an important 
topic in the area of natural language processing and 
machine learning, (c.f. [ |PTL92| , [ALN94 |LA95[ |) In 
this paper, we propose a new method of learning such 
knowledge, and empirically demonstrate its effective- 
ness. 

The knowledge of case slot patterns can be thought 
of as the co-occurrence information between verbs and 
nounsQ at a fixed case slot, such as at the subject po- 
sition. In this paper, we employ the following quan- 
tity as a measure of co-occurrence (called 'association 
norm'): 

p(n,v) 



A(n, v) 



p{n)p(v) 



(1) 



where p(n, v) denotes the joint distribution over the 
nouns and the verbs (over N X V), and p(n) and p(v) 
the marginal distributions over N and V induced by 
p(n,v), respectively. Since A(n, v) is obtained by di- 
viding the joint probability of n and v by their respec- 
tive marginal probabilities, it is intuitively clear that 



Real World Computing Partnership 



1 We are interested in the co-occurrence information be- 
tween any two word categories, but in much of the paper 
we assume that it is between nouns and verbs to simplify 
our discussion. 



it measures the degree of co-occurrence between n and 
v. This quantity is essentially the same as a measure 
proposed in the context of natural language processing 



we make use of the following identity: 



by Church and Hanks [CH8£] called the 'association 
ratio,' which can be defined as J(n, v) — log A(n,v). 
Note that I(n, v) is the quantity referred to as 'self mu- 
tual information' in information theory, whose expec- 
tation with respect top(n, v) is the well-known 'mutual 
information' between random variables n and v. The 
learning problem we are considering, therefore, is in 
fact a very general and important problem with many 
potential applications. 

A question that immediately arises is whether the 
association norm as defined above is the right mea- 
sure to use for the purpose of ambiguity resolu- 
tion. Below we will demonstrate why this is in- 
deed the case. Consider the sentence, 'the sailor 
smacked the postman with a bottle.' The ambigu- 
ity in question is between 'smacked ... with a bot- 
tle' and 'the postman with a bottle.' Suppose we 
take the approach of comparing conditional probabili- 
ties, pi nst (smack\bottle) and p poss (postman\bottle) , as 

(Here we let p caS e, in 



in some past research [LA95 



general, denote the joint/conditional probability dis- 
tribution over two word categories at the case slot 
denoted by case.) Then, since the word 'smack' 
is such a rare word, it is likely that we will have 
Pinst(smack\bottle) < p poss (postman\bottle), and con- 
clude as a result that the 'bottle' goes with the 'post- 
man.' Suppose on the other hand that we compare 
Ai nst (smack, bottle) and A poss (postman, bottle). This 
time we are likely to have Ai nst (smack, bottle) > 
Ap OSS (postman, bottle), and conclude that the 'bottle' 
goes with 'smack,' giving the intended reading of the 
sentence. The crucial fact here is that the two words 
'smack' and 'postman' have occurred in the sentence 
of interest, and what we are interested in comparing is 
the respective likelihood that two words co-occurred 
at two different case slots (possessive/instrumental), 
given that the two words have occurred. It there- 
fore makes sense to compare the joint probability di- 
vided by the respective marginal probabilities, namely 
A(n, v) = p(n,v)/p(n)p(v). 

If one employed p(n\v) as the measure of co- 
occurrence, its learning problem, for a fixed verb v, 
would reduce to that of learning a simple distribution. 
In contrast, as A(n, v) does not define a distribution, 
it is not immediately clear how we should formulate 
its estimation problem. In order to resolve this issue, 



p(n\v) 



p(n,v) p(n,v) 



p(v) p(n)p(v) 



■ p(n) = A(n,v) -p(n). 

(2) 

In other words, p(n\v) can be decomposed into the 
product of the association norm and the marginal dis- 
tribution over N. Now, since p(n) is simply a distri- 
bution over the nouns, it can be estimated with an or- 
dinary method of density estimation. (We let p(n) de- 
note the result of such an estimation.) It is worth not- 
ing here that for this estimation, even when we are es- 
timating p(n\v) for a particular verb v, we can use the 
entire sample for N x V . We can then estimate p(n\v), 
using as hyopthesis class H(p) — {A(n, v) ■ p(n)\A 6 
^4}, where A is some class of representations for the as- 
sociation norm A(n, v). Again, for a fixed verb, this is 
a simple density estimation problem, and can be done 
using any of the many well-known estimation strate- 
gies. In particular, we propose and employ a method 
based on the MDL (Minimum Description Length) 
principle [Rls78, QR89| ], thus guaranteeing a near op- 
timal estimation of p(n\v) |Yam92]. As a result, we 
will obtain a model for p(n\v), expressed as a prod- 
uct of A(n, v) and p, thus giving an estimation for the 
association norm A(n, v) as a side effect of estimating 
p(n\v). 

It has been noticed in the area of corpus-based natu- 
ral language processing that any method that attempts 
to estimate either a co-occurrence measure or a prob- 
ability value for each noun separately requires far too 
many examples to be useful in practice. (This is usu- 
ally referred to as the data sparseness problem.) In 
order to circu mvent this difficulty, we proposed in an 
earlier paper [LA95] an MDL-based method that esti- 
mates p(n\v) (for a particular verb), using a noun clas- 
sification that exists within a given thesaurus. That is, 
this method estimates the noun distribution in terms 
of a 'tree cut model,' which defines a probability dis- 
tribution by assigning a generation probability to each 
category in a 'cut' within a given thesaurus tree.0 
Thus, the categories in the cut are used as the 'bins' 
of a histogram, so to speak. The use of MDL ensures 
that an optimal tree cut is selected, one that is fine 
enough to capture the tendency in the input data, but 
coarse enough to allow the estimation of probabilities 
of categories within it with reasonable accuracy. The 



shortcoming of the method of [LA95|, however, is that 
it estimates p(n\v) but not A(n,v). 

In this paper, we apply the general framework of es- 

2 See Section 2 for a detailed definition of the 'tree cut 
models.' 



timating association norm to this particular problem 
setting, and propose an efficient estimation method for 
A(n, v) based on MDL. More formally, we assume that 
the marginal distribution over the nouns is definable 
by a tree cut model, and that the association norm (for 
each verb) can also be defined by a similar model which 
associates an A value with each of the cateogories in a 
cut in the same thesaurus tree (called an 'association 
tree cut model'), and hence p(n\v) for a particular v 
can be represented as the product of a pair of these 
tree cut models (called a 'tree cut pair model'). (See 
Figure [j] (a),(b) and (c) for examples of a 'tree cut,' a 
'tree cut model,' and an 'association tree cut model,' 
all in a same thesaurus tree.) We have devised an effi- 
cient algorithm for each of the two steps in the general 
estimation strategy, namely, of finding an optimal tree 
cut model for the marginal distribution p(n) (step 1), 
and finding an optimal association tree cut model for 
A(n, v) for a particular v (step 2). Each step will select 
an optimal tree cut in the thesaurus tree, thus provid- 
ing appropriate levels of generalization for both p(n) 
and A(n, v). 

We tested the proposed method in an experiment, in 
which the association norms for a number of verbs and 



nouns are acquired using WordNet [MBF + 93] as the 
thesaurus and using corpus data from the Penn Tree 
Bank as training data. We also performed ambiguity 
resolution experiments using the association norms ob- 
tained using our learning method. The experimental 
results indicate that the new method achieves better 
performance than existing methods for the same task, 
especially in terms of 'coverage.]^ We found that the 
optimal tree cut found for A(n, v) was always coarser 
(i.e. closer to the root of the thesaurus t ree) th an that 
for p(n\v) found using the method of | LA95 1 . This, 
we believe, contributes directly to the wider coverage 
achieved by our new method. 

2 The Tree Cut Pair Model 

In this section, we will describe the class of representa- 
tions we employ for distributions over nouns as well as 
the association norm between nouns and a particular 
verb^| 

A thesaurus is a tree such that each of its leaf nodes 
represents a noun, and its internal nodes represent 



3 Here 'coverage' refers to the percentage of the test data 
for which the method could make a decision. 

4 In general, this can be between words of any two cat- 
egory, but for ease of exposition, we assume here that it is 
between nouns and verbs. 



noun classes.^] The class of nouns represented by an 
internal node is the set of nouns represented by leaf 
nodes dominated by that node. A 'tree cut' in a the- 
saurus tree is a sequence of internal/leaf nodes, such 
that its members dominate all of the leaf nodes exhaus- 
tively and disjointly. Equivalently, therefore, a tree cut 
is a set of noun categories / nouns which defines a par- 
tition over the set of all nouns represented by the leaf 
nodes of the thesaurus. Now we define the notion of a 
'tree cut model' (or a TCM for short) representing a 
distribution over nouns .[] 

Definition 1 Given a thesaurus tree t, a 'tree cut 
model' is a pair p = (r,q), where r is a tree cut in 
t, and q is a parameter vector specifying a probability 
distribution over the members of t. 

A tree cut model defines a probability distribution 
by sharing the probability of each noun category uni- 
formly by all the nouns belonging to that category. 
That is, the probability distribution p represented by 
a tree cut model (t, q) is given by 



VC 6 tVi £ C p{x) 



|C| 



(3) 



A tree cut model can also be represented by a tree, 
each of whose leaf node is a pair consisting of a noun 
(cateogory) and a parameter specifying its (collective) 
probability. We give an example of a simple TCM for 
the category 'ANIMAL' in Figure 0(b). 

We similarly define the 'association tree cut model' (or 
ATCM for short). 

Definition 2 Given a thesaurus tree t and a fixed verb 
v, an 'association tree cut modeV(ATCM) A(-,v) is a 
pair {t,p), where r is a tree cut in t, andp is a function 
from t to $t. 

An association tree cut model defines an association 
norm by assigning the same value A of association 
norm to each noun belonging to a noun category. That 
is, 

VC GrVieC A(x, v) = A(C, v) (4) 

We give an example ATCM in Figure 0(c), which is 
meant to be an ATCM for the subject slot of verb 'fly' 
within the category of 'ANIMAL.' 

We then define the notion of a 'tree cut pair model,' 
which is a model for p(n\v) for some fixed verb v. 

5 This condition is not strictly satisfied by most of the 
publically available thesauruses, but we make this assump- 
tion to simplify the subsequent discussion. 

Thi s defin ition essentially follows that given by Li and 
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swallow crow eagle bird 



bug bee insect 
0.0 2.0 0.167 



crow 
0.15 



(c) 



Figure 1: (a) a tree cut (b) a TCM p (c) 

Definition 3 A 'tree cut pair model' h is a pair h = 
(A,p), where A is an association tree cut model (for 
a certain verb v), and p is a tree cut model (for N), 
which satisfies the stochastic condition, namely, 



neN 



A(n, v) ■ p(n) = 1. 



(5) 



The above stochastic condition ensures that h defines 
a legal distribution h(n\v). An example of a tree cut 
pair model is the pair consisting of the models of Fig- 
ure ||(b) and (c) , which together defines the distribu- 
tion shown in Figure |l](d) , verifying that it in fact sat- 
isfies the stochastic condition (|5|). 

3 A New Method of Estimating 
Association Norms 

As described in Introduction, our estimation proce- 
dure consists of two steps: The first step is for esti- 
mating p, and the second for estimating A given an 
estimation p for p. The first step can be performed 
by an estimation method for tree cut models proposed 
by the auth ors in | LA95 |, and is related to 'Context' 
of Rissanen |Ris83| . This method, called 'Find-MDL,' 
is an efficient implementation of the MDL principle 
for the particular class of tree cut models, and will 
be exhibited for completeness, as sub-procedure of the 
entire estimation algorithm. 

Having estimated p by Find-MDL using the entire 
sample of S (we write p for the result of this esti- 
mation), we will then estimate A. As explained in In- 
troduction, we will use as the hypothesis class for this 
estimation, H(p) = {A(n, v) ■ p(n)\A 6 A(t)} where 



(d) 

an ATCM A (d) distribution of h = A ■ p 

A(t) is the set of ATCMs for the given thesaurus tree 
t, and select, according to the MDL principle, a mem- 
ber of H{p) that best explains the part of the sample 
that corresponds to the verb v, written S v . That is, 
the result of the estimation, h, is to be given by[] 



h = arg min d.l.(h) 

heH(p) 



\ogh(n\v). (6) 



nes v 



In the above, we used l d.l.(hy to denote the 
model description length of h, and as is well-known, 
Sues ~^ogh(n\v) is the data description length for 
sample S v with respect to h. Since the model de- 
scription length of p is fixed, we only need to con- 
sider the model description length of A, which con- 
sists of two parts: the description length for the tree 
cut, and that for the parameters. We assume that we 
employ the 'uniform' coding scheme for the tree cuts, 
that is, all the tree cuts have exactly the same de- 
scription length. Thus, it suffices to consider just the 
parameter description length for the purpose of min- 
imization. The description length for the parameters 
is calculated as (par (A) /2) log \S V \, where par (A) de- 
notes the number of free parameters in the tree cut 
of A. Using (1/2) log \S V \ bits per parameter is known 
to be asymptotically optimal, since the variance of es- 
timation is of the order \J \ S v \ . Note here that we 
use log 15,, |/2 bits and not log 151/2, since the numer- 
ator h of A is estimated using S v , even though the 
denominator p is estimated using the entire sample S. 
The reason is that the estimation error for A, provided 
that we assume p(C) > e for a reasonable constant e, 
is dominated by the estimation error for h. 



All logarithms in this paper are to the base 2. 



Now, since we have h(n\v) = A{n, v) ■ p{n) by defini- 
tion, the data description length can be decomposed 
into the following two parts: 



E 

«es„ 



log h(r 



TiGS, 



-\ogA(n,v)+^2 -logp(n) 



nes„ 



(7) 

Notice here that the second term does not depend 
on the choice of A, and hence for the minimiza- 
tion in (^), it suffices to consider just the first term, 
YlneS ~ log j4(tt,|t;) . From this and the preceding dis- 
cussion on the model description length, (0) yields: 



- . par(A) 

h = arg mm log \S V \ 



E 

ties. 



log^4(n, v) 



(8) 



We will now describe how we calculate the data de- 
scription length for a tree cut pair model h = (A,p). 
The data description length given a fixed tree cut is 
calculated using the maximum likelihood estimation 
(MLE) for h(n\v), i.e. by maximizing the likelihood 
L(h, S v ) = lines h(n\v). Since in general the tree cut 
of A does not coincide with the tree cut of p, this max- 
imization problem appears somewhat involved. The 
following lemma, however, establishes that it can be 
solved efficiently. 

Lemma 1 Given a tree cut model p = (o~,p) and a 
tree cut r, the MLE (maximum likelihood estimate) h 
= A-p is given by setting h(C'\v) for each C G r by 



h(C'\v) 



UC',S V ) 
\Sv\ 



where in general we let |(C, S) denote the number of 
occurrences of nouns belonging to class C in sample 
S . The estimate for A is then given by letting for each 

a er, 

h{C'\v) 



A(C',v) = 



p(C>) 



where p{C) is defined inductively as follows: 

1. IfC = C for some C £ a, then p(C) =p(C). 

2. IfC dominates Ci, ...,Ck and p(Ci), ...,p(Ck) are 
defined, then p{C) = J2i=iP(Ci). 

3. If C is dominated by C and if p{C) is defined, 
then p{C) = l -^p{C). 

Proof of Lemma [l] 



Given the tree cuts, r and a, define r Ac to be the tree 
cut whose noun partition equals the coarsest partition 
that is finer than or equal to both the noun partitions 
of t and a. Then, the likelihood function L(h, S v ) 
which we are trying to maximize (for h — {A,p)) can 
be written as follows, 



L(h,S v )= J] (A(C,v)-p(C)f c ' S ^ 



(9) 



CGtAct 



where A(C\v) for C ^ r and p{C) for C ^ a are de- 
fined so that they be consistent^ with the definitions of 
A(n\v) and p(n). As before, since p is fixed, the above 
maximization problem is equivalent to maximizing just 
the product of A values, namely, 



ai'gmaxL(/i, S v ) — argmax f A(C,v) 



8(c,s„) 



(10) 

Since r A a is always finer than r, for each C E t A a, 
there exists some C G r such that A(C, v) = A(C ,v). 
Thus, 



argmaxL(/i, S v ) = argmax A(C',v) 



tl(C',s„ 



C'£t 



(11) 

Note that the maximization is subject to the condition: 
A(n,v) -p(n) = > ] A(C',v) -p{C) = 1. (12) 



E 

n£S„ 



E 

C'Sr 



Since multiplying by a constant leaves the argument 
of maximization unchanged, ([ll]) yields 

argmax L(h,S v ) = argmax TT (A(C',v)-p(C')) i{c ' A) 

A A ±A - 

C'Et 

(13) 

where the maximization is under the same condition 



(12) . Emphatically, the quantity being maximized in 

(13) is different from the likelihood in but both 
attain maximum for the same values of A. Thus the 
maximization problem is reduced to one of the form: 
'maximize Yi( a i ' Pi) ki subject to 'Pi ~ !• ^ s 
is well-known, this is given by setting, a, • pi = 



for each i. Thus, ( |10[) is obtained by setting, for each 
C G r, 



h(C') = A(C',v)-p(C) 



t(C',S v ) 



\S V \ 



Hence, A is given, for each C" G r, by 

h(c'\v) 



A(C',v) 



p(C>) 



(14) 



(15) 



That is, p is denned as specified in the lemma, and A 
is defined by inheriting the same value as the A value of 
the ascendant in r. 



This completes the proof. □ 

We now go on to the issue of how we can find a 
model satisfying (||) efficiently: This is possible with 
a recursive algorithm which resembles Find-MDL of 
LA95]. This algorithm works by recursively apply- 



ing itself on subtrees to obtain optimal tree cuts for 
each of them, and decides whether to return a tree 
cut obtained by appending all of them, or a cut con- 
sisting solely of the top node of the current subtree, 
by comparison of the respective description length. In 
calculating the data description length at each recur- 
sive call, the formulas of Lemma ^ are used to obtain 
the MLE. The details of this procedure are shown be- 
low as algorithm 'Assoc-MDL.' Note, in the algorithm 
description, that S denotes the input sample, which 
is a sequence of elements of N x V. For any fixed 
verb v € V, S v denotes the part of S that corresponds 
to verb v, i.e. S v = {n E S\(n,v) E S}m- (We use 
{}m when denoting a 'multi-set.') We use tti(S) to 
denote the multi-set of nouns appearing in sample S, 
i.e. 7ri(5) = {n 6 S\3v E V(n,v) E S} M . In general, t 
stands for a node in a tree, or equivalently the class of 
nouns it represents. It is initially set to the root node 
of the input thesaurus tree. In general, '[...]' denotes 
a list. 

algorithm Assoc-MDL(i, S) 

1. p := Find-MDL(i,7ri(S*)) 

2. A := Find-Assoc-MDL(S„,i,p) 

3. return((A,p)) 

sub-procedure Find-MDL(i, S) 

1. if t is a leaf node 

2. then return(([i],p(i, S))) 

3. else 

4. For each child U of t, c t :=Fmd-MDL(ij, S) 

5. 7:= append(ci) 



6. 



iftt(t,7ri(£))(-log^) + £logJV < 



J2u£children(t) > ^1 (S)) (- lo b ^ 

7. then return(([i],p(i, S))) 

8. else return(7) 



mi) + Mi ogN 



sub-procedure Find-Assoc-MDL(S , l) , t,p) 

1 . if t is a leaf node 

2. then return(([<], A(t, v))) 

3. else Let r := children(t) 

4. h(t\v) := ^ 

5. A(t,v):=^ 

/* We use definitions in Lemma |] to calculate p(t) * / 

6. For each child ti E r of t 



7. 7i :=Find-Assoc-MDL(S' 1 ,,t i ,p) 

8. 7:= append (7^ 

9. ifjJ(t,S , „)(-logi(t,«))- r |log|S , „| < 

Y, U er ttC*» ^)(" Iogi(ti, «)) + log 
/* The values of A(t i: v) used above are to be */ 
/* those in 7, */ 

11. then return(([t], A(t, u))) 

12. else return(7) 

Given Lemma |l|, it is not difficult to see that Find- 
Assoc-MDL indeed does find a tree cut pair model 
which minimizes the total description length. Also, 
its running time is clearly linear in the size (number 
of leaf nodes) in the thesaurus tree, and linear in the 
input sample size. The following proposition summa- 
rizes these observations. 

Proposition 1 Algorithm Find- Assoc-MDL outputs 
h E H(p) = {A ■ p\A E A(t)} (where A(t) denotes 
the class of association tree cut models for thesaurus 
tree t ) such that 



h = arg min d.l.(h) 

heH(p) 



■E 



- log h(n\v) 



and its worst case running time is 0(\S\ ■ \t\), where 
\S\ is the size of the input sample, and \t\ is the size 
( number of leaves ) of the thesaurus tree. 

We note that an analogous (and easier) proposition on 
Find-MDL is stated in ]LA95|. 



4 Comparison with Existing Methods 

A simpler alternative formulation of the problem of 
acquiring case frame patterns is to think of it as the 
problem of learning the distribution over nouns at a 
given case slot of a given verb, as in |LA95| . In that 
paper, the algorithm Find-MDL was used to estimate 
p(n\v) for a fixed verb v, which is merely a distribu- 
tion over nouns. The method was guaranteed to be 
near-optimal as a method of estimating the noun dis- 
tribution, but it suffered from the disadvantage that it 
tended to be influenced by the absolute frequencies of 
the nouns. This is a direct consequence of employing a 
simpler formulation of the problem, namely as that of 
learning a distribution over nouns at a given case slot 
of a given verb, and not an association norm between 
the nouns and verbs. 

To illustrate this difficulty, suppose that we are given 
4 occurrences of the word 'swallow,' 7 occurrences of 
'crow,' and 1 occurrence of 'robin,' say at the sub- 
ject position of 'fly' The method of |LA95] would 



probably conclude that 'swallow' and 'crow' are likely 
to appear at subject position of 'fly,' but not 'robin.' 
But, the reason why the word 'robin' is not observed 
many times may be attributable to the fact that this 
word simply has a low absolute frequency, irrespective 
of the context. For example, 'swallow,' 'crow,' and 
'robin' might each have absolute frequencies of 42, 66, 
and 9, in the same data with unrestricted contexts. In 
this case, their frequencies of 4, 7 and 1 as subject of 
'fly' would probably suggest that they are all roughly 
equally likely to appear as subject of 'fly,' given that 
they do appear at all. 



An earlier method proposed by Resnik | Res92 takes 
into account the above intuition in the form of a heuris- 
tic. His method judges whether a given noun tends to 
co-occur with a verb or not, based on its super-concept 
having the highest value of association norm with that 
verb. The association norm he used, called the 'selec- 
tional association' is defined, for a noun class C and a 
verb v, as 

x p(n) log 



n£C 



' p(n)p(v) 



Despite its intuitive appeal, the most serious disadvan- 
tage of Resnik's method, in our view, is the fact that 
no theoretical justification is provided for employing it 
as an estimation method, in contrast to the method of 
Li and Abe | LA95| , which enjoyed theoretical justifica- 
tion, if at the cost of an over-simplied formulation. It 
thus naturally leads to the question of whether there 
exists a method which estimates a reasonable notion of 
association norm, and at the same time is theoretically 
justified as an estimation method. This, we believe, is 
exactly what the method proposed in the current pa- 
per provides. 

5 Experimental Results 

5.1 Learning Word Assocation Norm 

The training data we used were obtained from 
the texts of the tagged Wall Street Journal corpus 
(ACL/DCI CD-ROM1), which contains 126,084 sen- 
tences. In particular, we extracted triples of the form 
{verb, caseslot, noun) or (noun, caseslot, noun) us- 
ing a standard pattern matching technique. (These 
two types of triples can be regarded more generally 
as instances of (head, caseslot, slotjualue).) The the- 
saurus we used is basically 'WordNet' (versionl.4) 
MBF+93 , but as WordNet has some anomalies which 



what.[] Figure || shows selected parts of the ATCM 
obtained by Assoc-MDL for the direct object slot of 
the verb 'buy,' as well as the TCM obtained by the 
method of [LA9S], i.e. by applying Find-MDL on the 
data for that case slot. Note that the nodes in the 
TCM having probabilities less than 0.01 have been dis- 
carded. 

We list a number of general tendencies that can be 
observed in these results. First, many of the nodes 
that are assigned high A values by the ATCM are 
not present in the TCM, as they have negligible ab- 
solute frequencies. Some examples of these nodes are 
(property, belonging...), (right), (owernership) , and 
(part,...). Our intuition agrees with the judgement 
that they do represent suitable direct objects of 'buy,' 
and the fact that they were picked up by Assoc-MDL 
despite their low absolute frequencies seems to con- 
firm the advantage of our method. Another notable 
fact is that the cut in the ATCM is always 'above' 
that of the TCM. For example, as we can see in Fig- 
ure^, the four nodes (action), {activity), (allotment) , 
and (commerce) in the TCM are all generalized as one 
node (act) in the ATCM, reflecting the judgement that 
despite their varying absolute frequencies, their asso- 
ciation norms with 'buy' do not significantly deviate 
from one another. In contrast, note that the nodes 
(property), (asset), and (liability) are kept separate 
in the ATCM, as the first two have high A values, 
whereas (liability) has a low A value, which is consis- 
tent with our intuition that one does not want to buy 
debt. 

5.2 PP-attachment Disambiguation 
Experiment 

We used the knowledge of association norms acquired 
in the experiment described above to resolve pp- 
attachment ambiguities. 

For this experiment, we used the bracketed corpus 
of the Penn Tree Bank (Wall Street Journal Corpus) 
MSM93] as our data. First we randomly selected one 
directory of the WSJ files containing roughly 1/26 
of the entire data as our test data and what re- 
mains as the training data. We repeated this pro- 
cess ten times to conduct cross validation. At each 
of the ten iterations, we extracted from the test data 



make it deviate from the definition of a 'thesaurus tree' 
we had in Section || we needed to modify it some- 



9 These anomalies are: (i) The structure of WordNet is 
in fact not a tree but a DAG; (ii) The (leaf and internal) 
nodes stand for a word sense and not a word, and thus the 
same word can be contained in more than one word senses 
and vice- versa. We refer the interested reader to [LA95] 
for the modifications we made. 



<propertyxliability><asset> <object> 

- - "1" "lO 7r 



<life form> 



estate debt stock 




<actionxactivity> 



<group action> 



computer <perso"n><plant> thing 

'0.02 1 . 01 



' ^ / 0.03 0.01 

♦ 7 — — 

operation <allotment3ccommerce> 



ATCM 
TCM 



child 
* descendant 



baby 



Figure 2: Parts of the ATCM and the TCM 



(verb, nouni, prep, noun?) quadruples, as well as the 
'answer' for the pp-attachment site for each quadru- 
ple by inspecting the parse trees given in the Penn 
Tree Bank. Then we extracted from the training data 
(verb, prep, nouri'i) and (noun\,prep,nouri2) triples. 
Having done so, we preprocessed both the training 
and test data by removing obviously noisy examples, 
and subsequently applying 12 heuristic rules, includ- 
ing: (1) changing the inflected form of a word to its 
stem form, (2) replacing numerals with the word 'num- 
ber,' (3) replacing integers between 1900 and 2999 with 
the word 'year,' etc.. On the average, for each itera- 
tion we obtained 820.4 quadruples as test data, and 
19739.2 triples as training data. 

For the sake of comp arison, we also tested the method 
proposed in |LA95 |, as well as a method based on 



Resnik's |Res92| . For the former, we used Find-MDL 
to learn the distribution of casejualues (nouns) at a 
specific caseslot of a specific head (a noun or a verb), 
and used the acquired conditional probability distribu- 
tion phead(casejualue\caseslot) to disambiguate the 
test patterns. For the latter, we generalized each 
casejvalue at a specific caseslot of a specific head to 
the appropriate level in WordNet using the 'selectional 
association' (SA) measure, and used the SA values of 
those generalized classes for disambiguation]^] 

More concretely, for a given test pattern (verb, noun\ , 
prep, noun-i), our method compares 
Ap r ep (nouri2, verb) and A prep (noun2,nouni), and at- 
tach (prep,nouri2) to verb or nouri\ depending on 
which is larger. If they are equal, then it is judged 
that no decision can be made. Disambiguation us- 
ing SA is done in a similar manner, by comparing 
the two corresponding SA values, while that by Find- 
MDL is done by comparing the conditional probabili- 





Coverage(%) 


Accuracy (%) 


Default 


100 


70.2 


MDL 


73.3 


94.6 


SA 


63.7 


94.3 


Assoc 


80.0 


95.2 



Table 1: Results of PP-attachment disambiguation 

ties, Pp re p(nouri2\ verb) and Pp rep (noun2\ noun\). 

Table [l] shows the results of this pp-attachment disam- 
biguation experiment in terms of 'coverage' and 'accu- 
racy.' Here 'coverage' refers to the percetage of the test 
patterns for which the disambiguation method made a 
decision, and 'accuracy' refers to the percetage of those 
decisions that were correct. In the table, 'Default' 
refers to the method of always attaching (prep, nouri2) 
to nounx, and 'Assoc' 'SA,' and 'MDL' stand for using 
Assoc-MDL, selectional association, and Find-MDL, 
respectively. The tendency of these results is clear: 
In terms of prediction accuracy, Assoc remains essen- 
tially unchanged from both SA and MDL at about 95 
per cent. In terms of coverage, however, Assoc, at 
80.0 per cent, significantly out-performs both SA and 
MDL, which are at 63.7 per cent and 73.3 per cent, 
respectively. 

Figure || plots the 'coverage-accuracy' curves for all 
three methods. The x-axis is the coverage (in ratio not 
in percentage) and the y-axis is the accuracy. These 
curves are obtained by employing a 'confidence test Q 



n We perform the following heuristic confidence test to 
judge whether a decision can be made. We divide the differ- 
ence between the two estimates by the approximate stan- 
dard deviation of that difference, heuristically calculated 



1 Resnik actually generalizes both the heads and the 
case-values, but here we only generalize case v alues to al- 
low a fair comparison. 



by y j± + jj^, where a i is the variance of the association 

values for the classes in the tree cut output for head and 
prep in question, and Ni is the size of the corresponding 
sub-sample. (The test is simpler for MDL.) 




0.84 
0.82 ■ 



Figure 3: The coverage-accuracy curves for MDL, SA 
and Assoc. 



forjudging whether to make a decision or not, and then 
changing the threshold confidence level as parameter. 
It can be seen that overall Assoc enjoys a higher cov- 
erage than the other two methods, since its accuracy 
does not drop nearly as sharply as the other two meth- 
ods as the required confidence level approaches zero. 
Note that ultimately what matters the most is the per- 
formance at the 'break-even' point, namely the point 
at which the accuracy equals the coverage, since it 
achieves the optimal accuracy overall. It is quite clear 
from these curves that Assoc will win out there. The 
fact that Assoc appears to do better than MDL con- 
firms our intuition that the association norm is better 
suited for the purpose of disambiguation than the con- 
ditional probability. The fact that Assoc out-performs 
SA, on the other hand, confirms that our estimation 
method for the association norm based on MDL is not 
only theoretically sound but excels in practice, as SA 
is a heuristic method based on essentially the same 
notion of association norm. 



6 Concluding Remarks 

We have proposed a new method of learning the 'asso- 
ciation norm' A(x,y) — p(x,y)/p(x)p(y) between two 
discrete random variables. We applied our method 
on the important problem of learning word associa- 
tion norms from large corpus data, using the class of 
'tree cut pair models' as the knowledge representation 
language. A syntactic disambiguation experiment con- 
ducted using the acquried knowledge shows that our 
method improves upon other methods known in the 
literature for the same task. In the future, we hope to 
demonstrate that the proposed method can be used in 
practice, by testing it on even larger corpus data. 
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